Search CORE

327 research outputs found

Integrity Constraints Revisited: From Exact to Approximate Implication

Author: Kenig Batya
Suciu Dan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Conference on Database Theory (ICDT 2020)
Publication date: 01/01/2020
Field of study

Integrity constraints such as functional dependencies (FD), and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes "in the limit". Finally, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Our results recover, and sometimes extend, several previously known results about the implication problem: implication of MVDs can be checked by considering only 2-tuple relations, and the implication of differential constraints for frequent item sets can be checked by considering only databases containing a single transaction

Dagstuhl Research Online Publication Server

Integrity Constraints Revisited: From Exact to Approximate Implication

Author: Kenig Batya
Suciu Dan
Publication venue
Publication date: 03/04/2019
Field of study

arXiv.org e-Print Archive

Episciences.org

A Dichotomy on the Complexity of Consistent Query Answering for Atoms with Simple Keys

Author: Koutris Paraschos
Suciu Dan
Publication venue
Publication date: 15/01/2014
Field of study

We study the problem of consistent query answering under primary key violations. In this setting, the relations in a database violate the key constraints and we are interested in maximal subsets of the database that satisfy the constraints, which we call repairs. For a boolean query Q, the problem CERTAINTY(Q) asks whether every such repair satisfies the query or not; the problem is known to be always in coNP for conjunctive queries. However, there are queries for which it can be solved in polynomial time. It has been conjectured that there exists a dichotomy on the complexity of CERTAINTY(Q) for conjunctive queries: it is either in PTIME or coNP-complete. In this paper, we prove that the conjecture is indeed true for the case of conjunctive queries without self-joins, where each atom has as a key either a single attribute (simple key) or all attributes of the atom

arXiv.org e-Print Archive

CiteSeerX

Oblivious Bounds on the Probability of Boolean Functions

Author: Gatterbauer Wolfgang
Suciu Dan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/09/2014
Field of study

This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.Comment: 34 pages, 14 figures, supersedes: http://arxiv.org/abs/1105.281

arXiv.org e-Print Archive

CiteSeerX

The Dichotomy of Conjunctive Queries on Probabilistic Structures

Author: Dalvi Nilesh
Suciu Dan
Publication venue
Publication date: 01/01/2007
Field of study

We show that for every conjunctive query, the complexity of evaluating it on a probabilistic database is either \PTIME or #\P-complete, and we give an algorithm for deciding whether a given conjunctive query is \PTIME or #\P-complete. The dichotomy property is a fundamental result on query evaluation on probabilistic databases and it gives a complete classification of the complexity of conjunctive queries

arXiv.org e-Print Archive

CiteSeerX

Crossref

How to Price Shared Optimizations in the Cloud

Author: Balazinska Magdalena
Suciu Dan
Upadhyaya Prasang
Publication venue
Publication date: 01/01/2011
Field of study

Data-management-as-a-service systems are increasingly being used in collaborative settings, where multiple users access common datasets. Cloud providers have the choice to implement various optimizations, such as indexing or materialized views, to accelerate queries over these datasets. Each optimization carries a cost and may benefit multiple users. This creates a major challenge: how to select which optimizations to perform and how to share their cost among users. The problem is especially challenging when users are selfish and will only report their true values for different optimizations if doing so maximizes their utility. In this paper, we present a new approach for selecting and pricing shared optimizations by using Mechanism Design. We first show how to apply the Shapley Value Mechanism to the simple case of selecting and pricing additive optimizations, assuming an offline game where all users access the service for the same time-period. Second, we extend the approach to online scenarios where users come and go. Finally, we consider the case of substitutive optimizations. We show analytically that our mechanisms induce truth- fulness and recover the optimization costs. We also show experimentally that our mechanisms yield higher utility than the state-of-the-art approach based on regret accumulation.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Communication Steps for Parallel Query Processing

Author: Beame Paul
Koutris Paraschos
Suciu Dan
Publication venue
Publication date: 01/01/2013
Field of study

We consider the problem of computing a relational query

q

on a large input database of size

n

, using a large number

p

of servers. The computation is performed in rounds, and each server can receive only

O(n/p^{1-\varepsilon})

bits of data, where

\varepsilon \in [0,1]

is a parameter that controls replication. We examine how many global communication steps are needed to compute

q

. We establish both lower and upper bounds, in two settings. For a single round of communication, we give lower bounds in the strongest possible model, where arbitrary bits may be exchanged; we show that any algorithm requires

\varepsilon \geq 1-1/\tau^*

, where

\tau^*

is the fractional vertex cover of the hypergraph of

q

. We also give an algorithm that matches the lower bound for a specific class of databases. For multiple rounds of communication, we present lower bounds in a model where routing decisions for a tuple are tuple-based. We show that for the class of tree-like queries there exists a tradeoff between the number of rounds and the space exponent

\varepsilon

. The lower bounds for multiple rounds are the first of their kind. Our results also imply that transitive closure cannot be computed in O(1) rounds of communication

arXiv.org e-Print Archive

CiteSeerX

Crossref